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Gene expression programs have been found to be highly conserved between 
closely related species, especially when comparing the same tissue types between 
species. Such analysis is, however, much more challenging over larger evolutionary 
distances when complementary tissues cannot readily be defined. Here, we present 
the first cross-species mapping of tissue-specific and developmental gene 
expression patterns across a wide range of animals, including many non-model 
species. Importantly, our approach does not require the definition of homologous 
tissues. In our survey of 32 datasets across 23 species, we detected conserved 
expression programs on all taxonomic levels, both within animals and between the 
animals and their closest unicellular relatives, the choanoflagellates. We found that 
the rate of change in tissue expression patterns is a property of gene families. 
Subsequently, we used the conservation of expression programs as a means to 
identify neofunctionalization of gene duplication products. We found 1206 
duplication events where one of the two genes kept the expression program of the 
original gene, whereas the other copy adopted a novel expression program. We 
corroborated such potential neofunctionalizations using independent network 
information: the duplication product with the more conserved expression pattern 
shared more interaction partners with the non-duplicated reference gene than the 
more divergent duplication product. Our findings open new avenues of study for the 
comparison and transfer of knowledge between different species. 
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Introduction 

Gene functions have traditionally been determined using molecular and cellular 
approaches involving forward or reverse genetics. Functional annotations that were 
directly determined through these approaches are, however, not available at all for 
most species, and incomplete even for model species (Thomas et al. 2012). For 
non-model species, often only data transferred from other organisms is available. In 
this case, the degree of conservation of functions is uncertain, especially when a 
gene is duplicated in a non-model species, but not in the model species where its 
function has originally been studied. Previously, gene co-expression data has been 
used to find conserved co-expressed modules (Stuart 2003; Gerstein et al. 2014) 
and to uncover functional similarities between genes from different species (Chikina 
& Troyanskaya 201 1). However, the latter approach requires that the two species 
are well-studied in both gene expression and functional annotation, and will suffer 
from incomplete and biased annotations (Thomas et al. 2012). Developmental gene 
expression profiles between closely related species can be compared to find 
functional links between genes and to detect differences between orthologs (Yanai 
et al. 201 1 ; Levin et al. 201 2; Silver et al. 201 2). Existing approaches require that 
expression datasets have been obtained under comparable conditions for the 
respective species. For closely related species, homologous tissues can easily be 
identified (Niknejad et al. 2012), and cross-species correlations between 
homologous tissues of closely related species have previously been investigated 
(Piasecka, Kutalik, et al. 2012a; Liao & Zhang 2006). This is however a severe 
limitation for functional mapping between many species. Even between closely 
related species, the relative amounts of cell types that make up tissues may change. 
Across larger evolutionary distances, only few clearly homologous tissues are 
available. Nonetheless, it is possible to identify deep homologies among tissues 
(Shubin et al. 2009). For example, homologous structures have been identified in the 
nervous systems of vertebrates and annelids (Tomer et al. 2010; Strausfeld & Hirth 
2013). Other organs show functional convergence, for example mammalian liver and 
brown fat in flies, which both carry out xenobiotic clearance functions (Chung et al. 
2009). 

Many gene expression datasets have been generated under experimental 
conditions that represent non-physiological conditions, such as gene knockouts, 
which are not under evolutionary selection. Such data is therefore not necessarily 
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suitable for comparing gene expression across species (Seok et al. 2013). In 
contrast, the formation of tissues during development and the maintenance of tissue 
function throughout the life of an animal are crucial for survival and reproduction, 
and are therefore under direct evolutionary selection (Winter et al. 2004; Gu & Z. Su 
2007). Tissue expression data is available for many species, as tissues can be 
gathered even from non-model species where genetic tools such as transgenesis or 
RNAi are not available. Previous research has shown that it is possible to predict 
tissue-specific expression patterns from gene expression experiments within the 
same species (Chikina et al. 2009). However, it remains challenging to map tissue 
expression over larger phylogenetic distances. If such mapping was possible, we 
could substantially improve the annotation of non-model-species genomes, fill 
annotation gaps in model species, and in particular address the problem of gene 
duplications. 

We have developed a method to map tissue expression patterns of genes from one 
species to another, without defining equivalent tissues between the two species. For 
each gene of the source species, this approach predicts a virtual tissue expression 
pattern in the destination species. These virtual expression patterns can then be 
compared to the expression patterns of genes in the target species, enabling us to 
calculate correlations between the expression patterns of genes across species. We 
showed that high correlations in tissue expression across species are predictive for 
1 :1 orthology, shared structure, and similar function. Subsequently we used our 
modeling approach for three applications: first, for determining the degree of 
conservation of tissue-specific gene expression patterns, second, for comparing the 
speed of functional divergence between independently evolving members of protein 
families, and third, for analyzing the fate of gene duplication products. 

Results 

Correlation between tissues of distant species 

To analyze tissue expression across the entire metazoan kingdom, we gathered 
genome and tissue expression data from 32 datasets covering 23 different species 
(Table S1). Among these were eight chordate species: Ciona intestinales (Shoguchi 
et al. 201 1), Danio rerio (Domazet-Loso & Tautz 201 0), Gallus gallus (Chan et al. 
2009; Irie & Kuratani 201 1), Homo sapiens (Lukk et al. 201 0), Mus musculus (Irie & 
Kuratani 201 1 ; A. I. Su et al. 2004), Sus scrofa (Freeman et al. 2012), Tetraodon 
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nigroviridis (Chan et al. 2009), and Xenopus tropicalis (Chan et al. 2009; Yanai et al. 
2011); two cnidarians: Hydra vulgaris (Hemmrich et al. 2012) and Nematostella 
vectensis (Tulin et al. 2013); two flatworms: Schistosoma japonicum (Gobert et al. 
2009) and Schistosoma mansoni (Nawaratna et al. 201 1 ; Fitzpatrick et al. 2009); 
three insects: Anopheles gambiae (Baker et al. 201 1 ; Dissanayake et al. 2006; 
Goltsev et al. 2009), Bombyx mori (Xia et al. 2007) and Drosophila melanogaster (St 
Pierre et al. 2014; Robinson et al. 2013); seven nematodes: Ascaris suum (Wang et 
al. 2013), Brugia malayi and five Caenorhabditis species (Levin et al. 2012; Spencer 
et al. 201 1). Furthermore, we added the choanoflagellate Salpingoeca rosetta as an 
outgroup (Fairclough et al. 2013). Many datasets contain both tissues and 
developmental samples, e.g. different adult organs and embryonic stages. For the 
sake of brevity, we refer to the all of these samples as "tissues." 

To determine orthology relations between genes, we assembled groups of orthologs 
(OGs) using the eggNOG pipeline (Powell et al. 2014) on the genomes of the 
choanoflagellate Salpingoeca rosetta and 67 animals. We then computed gene trees 
for all OGs using GIGA (Thomas 201 0), which we then analyzed to extract 1 :1 
orthologs and duplication events. First, we quantified the correlation of gene 
expression between tissues across species. For each pair of datasets, we built gene 
expression vectors for all tissues using the expression patterns of 1 :1 orthologs. 
This yielded one vector of expression values for each tissue. We then calculated the 
correlation of these vectors across species and found that for 89.0% of all dataset 
pairs, more than half of the tissues in one dataset were significantly correlated with 
at least one tissue from the other dataset (using a p-value cutoff of 0.05 for each 
tissue pair). Importantly, this was true even across large phylogenetic distances. For 
example, between fly and C. elegans, the two largest correlations of 0.31 were 
between ovary and gonad, and between head and L2 glutamate receptor neurons. 
When we removed the three worst datasets from the analysis {Nematostella, Hydra 
and Bombyx), the fraction increased to 98.4% of all dataset pairs. Interestingly, for 
70.6% of all dataset pairs in the filtered set, all tissues of one dataset were 
significantly correlated with at least one tissue from the second dataset. These 
correlations suggested that it is feasible to map gene expression patterns between 
tissues of distantly related species, even if a homology relation between the tissues 
is not apparent. 
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Mapping gene expression between species 

To predict tissue expression patterns across species we chose a simple and 
transparent method, namely to train linear models for mapping expression values 
across species (Fig. 1). Given the tissue expression values for a source species, 
each linear model predicted the expression value for one tissue from the target 
species. Thus, for each combination of source and target species, we trained as 
many linear models as there are tissues in the target species. Importantly, this 
modeling approach did not require 1 :1 relationships of tissues (i.e. the existence of 
homologous tissues). Rather, the expression in each tissue of the target species 
was modeled as a combination of the tissues in the source species (see Methods). 



I Expression I 
y patterns y 
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I Keep 1:1 I 
y orthologs ▼ 



I Train models and 
f map expression 




Tissues 



Correlations between 
mapped and actual 
expression pattern 




between 1 :1 orthologs 



Fig. 1: Mapping expression patterns across species. For each tissue in the target species, models 
were trained to predict the tissue-specific gene expression pattern from the expression patterns of 1 :1 
orthologs in a source species. Mapping the expression patterns of all genes created virtual expression 
patterns, which could then be used to compute correlations between the mapped and actual 
expression patterns. 
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correlation between mapped and actual expression pattern 
Fig. 2. Distribution of correlations between mapped and actual expression patterns, (a) When 
mapping expression patterns from fly to C. elegans, correlations between orthologs (green) and 1 :1 
orthologs (blue) were much higher than for background gene pairs (pairs of genes that are not 
homologous to each other, shown in red), (b) Target genes were distributed in bins according to the 
number of genes with similar expression patterns within the target species. Pairs of background genes 
had a higher correlation when there were more genes with similar expression patterns, as is evident 
from the shift towards higher correlations. For this pair of datasets, bins contained between 297 and 
316 one-to-one orthologs, with an average of 305. 

Using the trained model, we mapped all expression values from the source to the 
target species. We then calculated the Pearson correlation between the mapped 
expression values and the actual expression values, for three sets of genes: (1) all 
genes having homologs between the two species, (2) orthologous groups and (3) 1 :1 
orthologs. We restricted the background set (group 1) to genes with homologs to 
exclude lineage-specific genes that were found to have much lower correlations 
than genes with homologs. When analyzing a pair of genes that are 1 :1 orthologs, 
we used expression values predicted by 10-fold cross-validation. From the 
distribution of correlations, we calculated p-values for all pairs of genes using the 
null hypothesis that the compared genes belong to the background and thus are not 
orthologous. During initial tests, we found a strong correlation between these 
p-values and the number of genes with similar expression patterns in the target 
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species (Fig. S1 , dashed lines). We therefore split target genes into bins according 
to the number of target genes with similar expression patterns (Fig, 2b). For each 
bin, we obtain a mapping from correlation to p-values. For a given correlation 
between the mapped expression pattern of the source gene and the expression 
pattern of the target gene, we then calculate an expression distance out of the 
p-values obtained for the adjacent bins (see Methods). Thus, a low expression 
distance indicates that the expression of this gene in a given target species can be 
well predicted using the expression of homologous genes in the source species. 

The mapping success can be measured in different ways. For each pair of datasets, 
we first compared the distribution of correlations for background genes and 1 :1 
orthologs using the Kolmogorov-Smirnov (K-S) test. Controlling for multiple testing 
with the Benjamini-Hochberg method (Benjamini & Hochberg 1995), 77% of all K-S 
p-values were significant (q < 0.05). As K-S p-values are strongly influenced by the 
number of gene pairs, we also computed the fraction of 1 :1 orthologs that can be 
mapped at an expression distance threshold of 0.25 (Fig. 3). This fraction was highly 
correlated with the K-S statistic D (Pearson correlation 0.97), but more intuitive. 
Across all pairs of datasets, the median fraction of 1 :1 orthologs with expression 
distances below 0.25 was 40%, indicating an enrichment of 1 :1 orthologs with well- 
conserved expression patterns. This analysis revealed both an expected enrichment 
for closely related species and unexpectedly high enrichments between very distant 
species, such as between chordates and insects. In general, developmental 
datasets mapped less well to other species than datasets of adult tissues. However, 
this difference could be attributed to the information content of the different 
datasets, which we did not quantify. 
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Fig. 3. Conservation of expression patterns throughout the metazoans and choanoflagellates. 

For all dataset pairs, the fraction of 1 :1 orthologs with expression distances below 0.25 is shown. 
Within clades, this fraction becomes very high and approaches 1 in some cases. When there is no 
enrichment of 1 :1 orthologs towards lower expression distances, the distributions of correlations are 
identical for pairs of background genes and orthologs. In this case, the distribution of expression 
distances is uniform, and the fraction of orthologs with expression distances below 0.25 is 0.25 (see 
Fig. 5). Note that there are some datasets with universally low values. Here, the kinds of measured 
tissues and the quality of the dataset apparently prevented better mapping performance. However, 
some otherwise distant species had a higher than expected fraction of 1 :1 orthologs with well- 
conserved expression patterns. 

Benchmarks 

In order to establish the biological relevance of our expression distance measure, 
we applied benchmarks at three levels, namely sequence, structure, and function. 
On the sequence level, we found that expression distances could be used to decide 
which of the top two BLAST hits for a query protein is the true 1 :1 ortholog of the 
query protein in the target species (Fig. 4A and Fig. S3). On the structural level (Lees 
et al. 2014), expression distance and the number of proteins belonging to a 
structural fold were correlated (Fig. 4B and Fig. S4). That is, structural folds with 
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fewer members, and hence lower functional diversity, were more similar in their 
expression patterns across species. Lastly, on the functional level, we applied the 
phenolog concept (McGary et a\. 2010) to find equivalent phenotypic annotations 
across species. We found that expression distances could be used to predict which 
member of a protein family has been annotated with a phenotype (Fig. 4C and Fig. 
S5). 



Prediction of 1 :1 orthologs from top two BLAST hits 
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Fig. 4. Summary of benchmarking results. For each pair of datasets from different species, the 
performance in different benchmarks has been computed, along with a p-value. In all three 
benchmarks, there was a clear shift of the results relative to the random expectation (black line). For 
details, see Fig. S3, Fig. S4, Fig. S5 and supplementary text. Due to limited structural and functional 
annotations, there was a lower number of dataset pairs for the two lower panels. 
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Conservation of gene expression programs 

At all taxonomic levels, we determined the conservation of the expression patterns 
of 1 :1 orthologs. This data then allowed us to estimate the degree of conservation of 
tissue-specific expression patterns, even between groups of species that do not 
have readily identifiable homologous organs. For all sets of 1 :1 orthologs, we 
computed the median expression distances when mapping across a particular 
taxonomic split (e.g. for vertebrates, we mapped between fish and tetrapods). First, 
we compared the distributions of expression distances to the uniform distribution. 
With the exception of mappings with cnidaria {Nematostella and Hydra), all 
distributions were significantly shifted towards lower p-values (Fig. 5), confirming 
that our approach can predict expression patterns over large evolutionary distances. 
For some clades, the available data was very uneven on the two sides of the 
taxonomic split. For example, at the level of eumetazoa, only two species with few 
tissues were available for cnidarians, whereas most bilaterian species had many 
tissues measured. Thus, expression distances were higher when mapping from 
cnidarians to bilaterians than the other way round. Interestingly, the median 
divergence between animals and the outgroup choanoflagellates was comparable to 
the median divergence between major animal clades, e.g. bilateria. 

When we chose an expression distance cutoff of 0.25 to designate well-conserved 
genes, we found that 77% of all 1 :1 orthologs could be mapped successfully 
between mouse and human (Fig. 6). For larger clades (like vertebrates), we 
computed for each OG the median of all pairwise expression distances between the 
subclades (in this example, tetrapods and fish). Between tetrapods and fish, we 
found that 55% of all OGs have an expression distance below 0.25. Between 
animals and the outgroup choanoflagellate, 32.7% of all 1 :1 orthologs showed 
conserved expression, which is a significant increase over the 25% expected when 
1:1 orthologs behave like background genes (p-value of one-sided binomial test: 
3e-23). Thus, mapping tissue-specific gene expression revealed expression 
programs conserved for 1 billion years. As the median expression similarities were 
negatively influenced by datasets of low quality or small size, we also computed the 
distributions of expression distances and the number of well-conserved OGs for the 
best dataset pair across each taxonomic split (Fig. S6 and Fig. S7). 
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0.00 0.25 0.50 0.75 



Fig. 5: Distribution of median conserved expression. For each clade, the distribution of expression 
distances of 1:1 orthologs is shown. Red and blue colors denote the direction of the mapping, either 
from the first subclade to the second or vice versa. The black bar corresponds to the expression 
distance cutoff of 0.25 (Fig. 6). When the mapping is successful, our mapping procedure yields virtual 
expression patterns of 1 :1 orthologs that are very similar to the actual expression patterns, and the 
distribution of expression distances is skewed towards lower values. Our mapping procedure becomes 
less accurate over larger evolutionary distances, and the distribution of expression distances becomes 
less skewed. It becomes a uniform distribution when 1:1 orthologs cannot be mapped better than 
background gene pairs. Clades are numbered corresponding to the taxonomic tree in Fig. 6. 
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Fig. 6: Median conserved expression across animal clades. At each bifurcation, the pie chart 
denotes the median fraction of 1 :1 orthologs with expression distances below 0.25. The area 
corresponds to the number of 1 :1 orthologs across the taxonomic split. The numbers below the pie 
charts refer to the clade numbers in Fig. 5. 

The conservation of conservation 

In the previous section, we showed that there is an enrichment in conserved 
expression programs across most taxonomic splits. Here, we analyzed to what 
extent the conservation of expression programs (i.e. the expression distance 
between family members) depends on the gene family. If the rate of expression 
divergence is a property of the gene family we expect a correlation between the 
expression similarities for each family in different clades. In other words, a gene that 
has a conserved expression pattern in one clade should also have a conserved 
expression pattern in another clade. For each taxonomic split with two or more 
species on either side of the split, we calculated the median expression distance per 
gene family within each of the two clades. Out of six taxonomic splits with more 
than one species on both sides, we found significant Spearman correlations (r s ) of 
median expression similarities for three splits (Fig. 7A): between tetrapods and 
fishes (r s =0.18, #13 in Fig. 6), between protostomes and deuterostomes (r s =0.14, 
#4), and between nematodes and insects (r s =0.074, #7). Not significant were the 
splits involving cnidaria (#2), Schistosoma (#5) and spirurina (#10). Thus, expression 
distances were correlated across most taxonomic splits. 



The previous analysis was only possible for a subset of the taxonomic splits in our 
body of data, due to the requirement of having more than one species on either side 
of the split. We therefore also analyzed the fate of duplicated genes. In this case, we 
tested whether duplication products are more similar if the non-duplicated members 
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of the gene family have low expression distances across the species outside the 
duplication event. Indeed, we found significant negative correlations between the 
median expression distance among the non-duplicated genes and the intra-species 
correlation of the duplicated genes (Fig. 7B). For example, duplicated genes in fish 
were more similar (i.e. has a higher correlation) when the corresponding tetrapod 
genes had more similar expression patterns (i.e. had a low expression distance): 
r s =-0.1 1 for 1999 pairs of duplicated genes, corresponding to a p-value of 1e-7. 
Taken together, these two observations implied that for a significant fraction of 
genes, the rates of change in gene expression patterns were correlated between 
independently evolving clades. 

# clade r s p-value n 

2 eumetazoa 0.04 0.03 2478 

4 bilateria 0.14 0 4201 

5 protostomia 0.05 0.24 629 
7 ecdysozoa 0.07 1E-05 3467 
10 rhabditida 0.04 0.13 1458 
13 vertebrata 0.18 0 3391 

clade with clade with 
f duplicated gene single genes r s p-value n 

1 Salpingocea eumetazoa -0.14 0.08 164 
X deuterostomia protostomia -0.03 0.004 8618 
5 Schistosoma ecdysozoa -0.11 0.02 440 
5 ecdysozoa Schistosoma -0.10 0.02 560 
7 endopterygota rhabditida -0.07 0.02 1120 

2 Ciona vertebrata -0.39 0.03 29 

3 actinopterygii tetrapoda -0.12 1E-07 1999 
Fig. 7. Correlations between expression conservation rates. A For 1 :1 orthologs, the expression 
distance across taxonomic splits was compared. In the dataset, there were only six splits with at least 
two species on both sides of the split. For example, when genes were similar within tetrapods, they 
also tended to be similar within fishes. B The rate at which gene duplication products diverge was 
negatively correlated with the expression distance among single-copy genes in related species. Only 
correlations with p-values below 0.1 are shown in this table. (# - Number of clade in Fig. 6, n - count of 
1 :1 orthologs [A] or duplicated genes [B]) 

Evolution of the beta catenin protein family 

We have chosen the beta catenin protein family (Peifer et al. 1992) as an example to 
illustrate the implications of our work. Beta catenin proteins are involved in 
regulating cell adhesion and gene transcription through the Wnt signaling pathway. 
Ancestrally, there was a single beta catenin protein, which duplicated independently 
in the nematode and vertebrate lineages (Zhao et al. 201 1). Hence, Drosophila, 
Anopheles and Schistosoma have only one beta-catenin, armadillo. This protein was 
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similar in its expression patterns with both the vertebrate and nematode beta- 
catenins (Fig. 8), which is indicative of their functional similarities (White et al. 1998). 
In vertebrates, two forms exist: beta-catenin and plakoglobin. These two proteins 
have largely overlapping functions (Swope et al. 2013) and consequently, their 
observed expression distance was very low. In nematodes, the outcome of the 
repeated gene duplications (Liu et al. 2008; Korswagen et al. 2000; Natarajan et al. 
2001) is very different: three of the duplication products (hmp-2, wrm-1 , and sys-1) 
are very similar to each other in their expression patterns, which can be explained 
by their cooperation in in the non-canonical Wnt signaling pathway and the SYS 
pathway (Kidd et al. 2005). These three proteins showed a significant dissimilarity in 
their expression patterns compared to bar-1 . In contrast to them, bar-1 is part of a 
canonical Wnt signaling pathway (Kidd et al. 2005). We also observed that bar-1 had 
a low expression distance to the two vertebrate beta catenins, while hmp-2, wrm-1 , 
and sys-1 showed significant dissimilarity with plakoglobin. This example illustrates 
that our method is able to uncover patterns of expression similarity and divergence 
both between closely related species and across large evolutionary distances. 



Fig. 8. Expression similarity and divergence in the beta catenins. Based on the expression 
divergence scores of the individual proteins (Fig. S8), we computed p-values for the expression 
similarity and divergence of groups of proteins (see Methods): Blue cells indicate that the proteins have 
low expression divergence scores, and red cells that they have high expression divergence scores. 
Profiles are clustered based on their Pearson correlation. The unduplicated beta catenins from insects, 
flatworms and cnidarians a similar to all other protein groups, while functional and expression 
divergence has occurred among the nematodes. 




Caenorhabditis wrm-1 
Caenorhabditis sys-1 
nematode hmp-2 
insect/flatworm arm 
vertebrate CTNNB1 
cnidarian beta-catenin 
vertebrate JUP 
nematode bar-1 





similar divergent 
expression patterns 
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Resolving the fate of gene duplication products 

The above results suggested that our expression distance measure could be used 
to compare functions of gene duplication products (i.e. in-paralogs) across species 
(i.e. with their respective orthologs). To this end, we created an additional 
expression distance metric that combines measures for expression similarity and 
dissimilarity, which we term "expression divergence score." This allowed us to also 
test if two genes have significantly diverging expression patterns. As above, we 
used the p-value for the null hypothesis that the genes are not related to each other 
(p b ) to quantify expression similarity. To measure expression dissimilarity, we used 
the p-value for the null hypothesis that considered genes are in fact 1 :1 orthologs 
(p 0 ). We then combined the two p-values into an expression divergence score E: 
E = \~ logl ° Po Po - Pb 

I lo glO Pb Po > Pb 

Thus, the expression divergence score E is negative for similar, and positive for 
dissimilar gene pairs. Considering gene duplications, we computed divergence 
scores for both duplication products (Fig. 9). Using log 10 (0.25) as a cutoff, we 
divided pairs of duplication products into three categories: (a) both genes had 
conserved expression patterns (2226 pairs of duplication products), (b) both genes 
had diverging expression patterns (91 1 pairs) and (c) only one of the duplication 
products had a diverging expression pattern, while the other one was conserved 
(1206 pairs). For each duplication event we also computed the expression 
conservation of the respective non-duplicated orthologs among each other (Fig. S9). 
It turned out that, when expression distances among the non-duplicated orthologs 
were small, the respective duplication products were more likely to be both 
conserved (Fig. 10), supporting again that purifying selection acts across large 
phylogenetic distances. 
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Fig. 9. Expression divergence scores of duplication products. For each duplication event, the 
expression divergence score of the duplication products to the non-duplicated genes was computed. 
For each pair of duplication products, the expression divergence scores were sorted. Thus, in the 
lower left quadrant, both duplication products had conserved expression patterns. In the upper right, 
both duplication products had diverging expression patterns and in the lower right, the outcome was 
mixed. Black lines denote an expression divergence score cutoff of 0.25. 
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Fig. 10. Counts of duplication outcomes. Duplication events were grouped into quartiles according 
to the expression distance among the non-duplicated genes (Fig. S9). For each quartile, counts and 
fractions of the different outcomes are shown as the p-value cutoff is varied. This maximum p-value 
corresponds to a minimum of the absolute value of the expression divergence score, e.g. |E| > 
log 10 (0.25) for the black lines. 

Functional implications of diverging expression patterns in duplication products 
In order to independently validate the functional implications of the observed 
conservation of expression programs between duplication products, we utilized 
protein-protein interaction (PPI) data from the STRING database (Franceschini et al. 
2013). In particular, we investigated whether duplication products with expression 
programs more closely resembling non-duplicated orthologs are also functionally 
more related to the non-duplicated genes. Thus, for each pair of duplicated genes 
we designated the gene with the lower expression distance to non-duplicated 
orthologs as "less divergent" and the gene with the greater expression distance as 
"more divergent." 
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Fig. 11. Differences in the protein interaction network. (A) For all pairs of duplication products, we 
determined the number of interaction partners. Each dot corresponds to a pair of duplication products. 
Blue lines show linear fits for each dataset. (B) Here, each dot corresponds to the combination of a 
reference species (where the gene has not been duplicated) and a pair of duplication products. Blue 
lines correspond to linear fits for each combination of reference species and dataset. (C) The subset of 
duplication products is shown for which the outcome of the duplication could be determined. The 
difference between the two duplication products is largest for when only one of the duplication 
products diverges in its expression pattern. Here, the complete STRING network with confidence 
cutoff 0.5 is used. For an analysis of other networks and cutoffs, see Fig. S1 0, Fig. S1 1 , and Fig. S1 2. 



First, we found that less divergent genes had significantly more interaction partners 
than more divergent genes in 15 out of 25 datasets with available STRING data 
(using a p-value cutoff of 0.05 for one-sided Wilcoxon signed-rank tests, Fig. 1 1 A 
and Fig. S10). From this data, it remained unclear if the less divergent protein gained 
interaction partners, or if the more divergent protein lost interaction partners. 
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However, the latter hypothesis seemed more parsimonious to us: interaction 
partners are often tissue specific. Thus, if the diverged protein got expressed in 
different tissues it likely lost some of its former interaction partners. To further 
corroborate this notion, we compared the interaction partners of the duplication 
products with the interaction partners of the respective non-duplicated genes. 

We mapped PPI across species by counting how many OGs were shared between 
the interaction partners of the duplication product and the reference protein in a 
second species. We then calculated the Jaccard indices for the shared interaction 
partners between the reference protein and either of the duplication products (Fig. 
1 1 B). We found that for 40.5% of 395 dataset-species pairs, less divergent genes 
had a significantly higher Jaccard index compared to the more divergent genes (Fig. 
S1 1), while there were no dataset-species pairs where the more divergent proteins 
had significantly higher Jaccard indices. Furthermore, when we distinguished the 
possible outcomes discussed above (both duplication products have conserved 
expression patterns, one diverges, or both diverge), major differences only occurred 
for the case of one gene diverging (Fig. 1 1 C and Fig. S1 2): in this case, less 
divergent genes had higher Jaccard indices in 40.0% of 175 dataset-species pairs, 
and all other outcomes were much less prevalent (<7%). (To be able to compare the 
p-values between the different outcomes, the same number of duplication products 
were used for the significance tests by sub-sampling 1 00 times.) This observation is 
consistent with earlier findings about the tissue specificity of protein complexes 
(Bornigen et a\. 2013) and strengthens the notion that the duplication product with 
the more divergent expression pattern lost previous interactions and acquired novel 
interaction partners. 

Expression divergence depends on protein function 

In order to evaluate the extent to which the conservation of expression patterns after 
gene duplications depends on the function of gene products, we assigned protein 
classes according to the PANTHER database (Mi et al. 2013) to all OGs. We then 
determined the significance of the enrichment of gene duplication events where 
both genes were conserved (bottom left corner of Fig. 9) versus cases where one 
gene diverged (bottom right corner of Fig. 9, see Methods for more details). 
Extracellular matrix proteins, transcription factors, and receptors were significantly 
enriched for duplication events with conserved expression patterns (Fig. 12, top). 
Conversely, chaperones, calcium-binding proteins and transfer/carrier proteins are 
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enriched for duplication events with diverging expression patterns (Fig. 12, bottom), 
suggesting that regulatory proteins tend to be more conserved, whereas proteins 
executing tissue-specific functions are more likely to diverge. 
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Fig. 12. Gene duplication outcomes differ between protein classes. For each species and protein 
class, we analyzed duplications where one or both of the duplication products were conserved. Left: 
fraction of duplications where both duplication products had conserved expression patterns. Right: 
fractions normalized according to the species-wise average. Black bars correspond to the weighted 
average across all species. One-sided p-values have been calculated using a Poisson binomial 
distribution for each protein class, comparing the actual number of events to the expected number 
based on the species-wise averages. "Not assigned" marks OGs that could be mapped to PANTHER, 
but for which no protein class had been assigned in the PANTHER database. "Not mapped" stands for 
OGs for which no member could be mapped to the PANTHER database. Using an iterative method 
(see Methods) to remove multiple annotations per OG, we find a slightly different set of significant 
protein classes (Fig. S13). For example, nucleic acid binding proteins encompass transcription factors 
and are therefore enriched. 
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Discussion 

The presented analysis established and benchmarked a new method, and provided 
four biological conclusions: there is widespread conservation of expression 
regulation across very large evolutionary distances; independently evolving 
members of the same gene family have significant correlations in their speed of 
divergences; expression divergence can be used to grade the functional 
conservation of gene duplication products; and neofunctionalization of gene 
duplication products is dependent on gene function. 

In particular, we have shown that tissue-specific gene expression can be predicted 
across large evolutionary distances, even in the absence of apparent similarities 
between the species' tissues. Our approach can be rationalized as follows: we 
assume that evolution conserves the co-expression of functionally related genes, 
both on the level of homologous cell types and on the level of functional modules 
that occur in unrelated tissues. Our analysis demonstrated that the expression 
patterns of such conserved gene modules can be predicted across species using 
1 :1 orthologs as "anchors." This approach worked despite the fact that the tissues 
themselves are only conserved within smaller clades. Control of gene expression by 
transcription factors, miRNAs and other factors is known to turn over rather quickly 
(Odom et al. 2007; Bradley et al. 2010; Berezikov 2011). Most probably, functional 
dependencies between genes lead to shared expression patterns over large 
evolutionary distances. Further research will be needed to reveal which expression 
similarities between tissues are caused by homology and which are caused by 
convergent evolution. 

When we applied the concept of looking for correlations between orthologs across 
species to an existing dataset (Brawand et al. 201 1), we found that many of the 
reported lineage-specific expression shifts only changed the absolute expression 
levels, while the relative expression patterns remained conserved (Fig. S14). This 
suggests that further studies could combine approaches that test absolute and 
relative expression patterns to identify truly novel expression patterns. We 
investigated products of gene duplication events and found that they seem to have 
the ability to "opt out" of such gene expression modules to acquire new functions. 
Such events suggest unidirectional dependences: whereas the duplicated gene 
does not need (all of) its ancient interaction partners, the partners seem to need the 
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duplicated gene and, thus, one of the two remained in the respective expression 
module. A more detailed analysis, including the divergence on the level of the 
protein sequence and the mode of natural selection (such as positive or neutral 
selection) may lead to more connections between expression divergence, protein 
function and sequence evolution. 

Methods 

Import of expression data 

Datasets were obtained either from repositories like ArrayExpress and GEO, from 
supplementary materials or the respective websites of the resources. Expression 
profiles were then mapped to our set of genes by one of the following methods (see 
Table S1): If possible, genes were mapped by given identifiers, such as Affymetrix, 
Ensembl or WormBase identifiers. If identifiers could not be used for microarrays, 
we mapped probe sequences to transcripts using exonerate (Slater & Birney 2005), 
allowing for up to three mismatches and discarding probes that mapped to multiple 
genes. In the case of RNA-seq data without matching identifiers, we mapped reads 
to annotated transcripts using tophat2 and cufflinks 2.1.1 (Kim et al. 2013; Trapnell 
et al. 2010) and used the resulting FPKM counts. 

Normalization of expression data 

In initial small-scale tests, we tested several normalization methods (Liao & Zhang 
2006; Piasecka, Robinson-Rechavi, et al. 2012b), and settled on az-like 
normalization of expression vectors x, which corresponds to the Euclidean 
normalization of x minus its median value. Therefore, we did not look for conserved 
expression abundance, but rather for conserved relative expression across tissues. 
Normalizing each gene's expression individually also avoided technical concerns 
regarding the comparability of absolute expression values between genes. RNA-seq 
data, e.g. the Drosophila modENCODE dataset, contained zeros, which were of 
course not suitable for logarithmic analysis. For these datasets, we determined the 
expression value of the 1/1 000 th quantile of all genes with non-zero expression. All 
expression values were incremented by this value. 

Tissue correlations between species 

P-values for tissue correlations were calculated analytically. We performed tests 
with shuffling of genes to confirm that the analytical p-values correspond to 
empirical p-values. 
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Mapping of tissue expression patterns 

For each pair of datasets, individual linear models were trained for each tissue of the 
target species, using the tissues of the source species as input. (Note that due to 
the normalization, one tissue is redundant and therefore left out. This also implies 
that the coefficients of the linear model are not directly interpretable.) The set of 1:1 
orthologs between the two species was used as a training set. When there were 
multiple probes per gene, all combinations of probes were used for training. When 
there are many tissues in the source species, but few 1 :1 orthologs, there is the 
danger of over-fitting. We therefore allowed only one predictor (i.e. one tissue from 
the source species) per 15 samples (1 :1 orthologs) (Babyak 2004). For each pair of 
species, the safe number of predictors was calculated. If there were too many 
tissues, we combined tissues using /c-means clustering and used the centers of the 
clusters as predictors. This situation only occurred for six out of 992 dataset pairs. 
The trained models are then applied to all genes of the source species, yielding 
corresponding predicted expression patterns in the target species. Since 1 :1 
orthologs are used for training, we used predictions from a 10-fold cross-validation 
for these genes. 

Computation of expression distances 

For each pair of datasets, we computed a matrix of predicted expression patterns of 
all genes from the source species. We then calculated the weighted Pearson 
correlations between the mapped expression patterns and the actual expression 
patterns of the target species' genes. Weights on the tissues were calculated using 
the Gerstein-Sonnhammer-Chothia (GSC) weighting scheme to reduce the effect of 
uneven coverage of different anatomical regions (Gerstein et al. 1994). For example, 
in the mouse tissue datasets, there are many different brain tissues. Given the 
matrix of all weighted Pearson correlations, we then calculated expression distances 
like p-values, i.e. by determining the fraction of unrelated genes that have the same 
or higher correlation. For technical reasons, we sampled one million pairs of 
background genes, such that the lowest possible expression distance is 1e-6. 

As mentioned in the Results section, there is a strong correlation between the raw 
expression distances and the number of genes in the target species. This strong 
correlation indicated that predictions were biased towards the average target gene 
(i.e. the average expression profile of all genes considered in the target species), 
which in turn was similar to many target genes. As a consequence, these "close-to- 
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average" target genes had higher correlations with mapped source genes, and thus 
seemed more conserved. To counter this effect, target genes are split into ten bins 
according to the number of co-expressed genes in the target species. Thus, there 
exist ten conversion functions from weighted Pearson correlation to an uncorrected 
expression distance. For a given pair of genes, the final expression distance is 
interpolated from the two adjacent bins. We determined the number of co- 
expressed genes for each target gene as follows: we first computed all pairwise 
correlations among the target genes of the training set. Then, we determined the 
correlation cutoff corresponding to the top 10%, and counted for each gene how 
many other target genes were among the global top 10% correlations. 

Quantifying expression similarity and divergence for groups of genes 
Similarly to the definition of the expression divergence score E, we take two 
p-values into account for each pair of datasets: the p-value for the null hypothesis 
that the genes are not related to each other (p b ) and the p-value for the null 
hypothesis that considered genes are in fact 1 :1 orthologs (p 0 ). Given two groups of 
proteins, we then consider all interspecies combinations of datasets and compute 
paired Wilcoxon signed-rank tests to determine if the expression patterns are 
significantly similar (p b < p 0 ) or divergent (p b > p 0 ). For two groups of genes, we then 
report the lower p-value (Fig. 8). 

Analysis of protein classes 

Protein classes were obtained from the PANTHER 9.0 database (Mi et al. 2013) and 
mapped to OGs. For each combination of species and protein class, we then 
determined the fraction of duplication products where both products have 
conserved expression patterns, relative to the number of duplication products 
where at least one of the duplication products keeps a conserved expression 
pattern. For each species, there is thus a background frequency of duplication 
products that are both conserved. For each class, we then determined p-values for 
both over- and under-representation using a Poisson binomial distribution. This 
distribution is computed from the background frequencies and the number of 
duplications of the relevant class per species. As the protein classes overlap (e.g. 
kinases are also transferases), we also used an iterative method to determine p- 
values: after each iteration, all OGs that were annotated with the protein class with 
the most significant p-value were removed from the analysis (while keeping the 
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background frequencies constant). Thus, the new p-value for nucleic acids binding 
proteins excludes transcription factors, as these had a lower p-value (Fig. S13). 
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